04_time.RmdThis step of the BDC workflow extracts the collection year whenever possible from complete and legitimate date information, and flags dubious (e.g., 07/07/10), illegitimate (e.g., 1300, 2100), or not supplied (e.g., 0 or NA) collecting year.
Important:
The results of VALIDATION test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.
You can install the released version of BDC from github with:
if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")Creating folders to save the results.
bdc::bdc_create_dir()Read the database created in the Space step of the BDC workflow. It is also possible to read any datasets containing the **required** fields to run the workflow (more details here).
database <-
qs::qread("Output/Intermediate/03_space_database.qs")Standardization of character encoding.
for (i in 1:ncol(database)){
if(is.character(database[,i])){
Encoding(database[,i]) <- "UTF-8"
}
}VALIDATION. This function flags records lacking event date information (e.g., empty or NA).
check_time <-
bdc_eventDate_empty(data = database, eventDate = "verbatimEventDate")
#>
#> bdc_eventDate_empty:
#> Flagged 3179 records.
#> One column was added to the database.ENRICHMENT. This function extracts four-digit year from unambiguously interpretable collecting dates.
check_time <-
bdc_year_from_eventDate(data = check_time, eventDate = "verbatimEventDate")
#>
#> bdc_year_from_eventDate:
#> Four-digit year were extracted from 2933 records.VALIDATION. This function identifies records with illegitimate or potentially imprecise collecting year. The year provided can be out-of-range (e.g., in the future) or collected before a specified year supplied by the user (e.g., 1900). Older records are more likely to be imprecise due to the locality-derived geo-referencing process.
check_time <-
bdc_year_outOfRange(data = check_time,
eventDate = "year",
year_threshold = 1900)
#>
#> bdc_year_outOfRange:
#> Flagged 12 records.
#> One column was added to the database.Creating a column named .summary summing up the results of all VALIDATION tests. This column is FALSE witharecord is flagged as FALSE in any data quality test (i.e. potentially invalid or suspect record).
check_time <- bdc_summary_col(data = check_time)
#> Column '.summary' already exist. It will be updated
#>
#> bdc_summary_col:
#> Flagged 3481 records.
#> One column was added to the database.Creating a report summarizing the results of all tests of the BDC workflow
report <-
bdc_create_report(data = check_time,
database_id = "database_id",
workflow_step = "time")
#>
#> bdc_create_report:
#> Check the report summarizing the results of the time in:
#> Output/Report
reportCreating a histogram showing the number of records collecting over the years.
bdc_create_figures(data = check_time,
database_id = "database_id",
workflow_step = "time")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Figures
Number of records sampled over the years

Summary of all tests of the time step; note that some database lack event date information

Summary of all validation tests of the BDC workflow
Save the original database containing the results of all data quality tests appended in separate columns.
Let’s remove potentially erroneous or suspect records flagged by the data quality tests applied in all steps of the BDC workflow to get a “clean”, “fitness-for-use” database. Note that 29% (2,631 out of 9.000 records) of original records were considered “fitness-for-use” after the data-cleaning process.
output <-
check_time %>%
dplyr::filter(.summary == TRUE) %>%
bdc_filter_out_flags(data = ., col_to_remove = "all")
#>
#> bdc_fiter_out_flags:
#> The following columns were removed from the database:
#> .uncer_terms, .val, .equ, .zer, .cap, .cen, .urb, .otl, .gbf, .inst, .dpl, .rou, .eventDate_empty, .year_outOfRange, .summary